Multilingual Topic Detection Using a Parallel Corpus

نویسندگان

  • Wai Lam
  • Helen M. Meng
چکیده

We have developed an approach for topic detection from multilingual news, in particular Chinese and English. We extract named entities such as people names, geographical location names, and organization names automatically from the news content by transformation-based linguistic taggers. These sets of named entities together with the remaining content terms form the basis of news representation. Gross translation of Chinese story representation into English is conducted using easily available resources. We have investigated two approaches for gross translation. One is a basic method using only a bilingual dictionary. The second approach makes use of a parallel corpus as an additional resource. The topic discovery task uses a modified agglomerative clustering algorithm to group stories. One difference between our clustering approach and the standard agglomerative one is that we maintain three kinds of elements in the clustering process, namely, story, temporary clusters, and final clusters.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Multilingual Topic Models for Improved Alignment in English-Hindi MT

Parallel corpora are often injected with bilingual dictionaries for improved Indian language machine translation (MT). In absence of such dictionaries, a coarse dictionary may be required. This paper demonstrates the use of a multilingual topic model for creating coarse dictionaries for English-Hindi MT. We compare our approaches with: (a) a baseline with no additional dictionary injection, and...

متن کامل

Multilingual Relevant Sentence Detection Using Reference Corpus

IR with reference corpus is one approach when dealing with relevant sentences detection, which takes the result of IR as the representation of query (sentence). Lack of information and language difference are two major issues in relevant detection among multilingual sentences. This paper refers to a parallel corpus for information expansion and translation, and introduces different representati...

متن کامل

That'll Do Fine!: A Coarse Lexical Resource for English-Hindi MT, Using Polylingual Topic Models

Parallel corpora are often injected with bilingual lexical resources for improved Indian language machine translation (MT). In absence of such lexical resources, multilingual topic models have been used to create coarse lexical resources in the past, using a Cartesian product approach. Our results show that for morphologically rich languages like Hindi, the Cartesian product approach is detrime...

متن کامل

Discovering Parallel Text from the World Wide Web

Parallel corpus is a rich linguistic resource for various multilingual text management tasks, including crosslingual text retrieval, multilingual computational linguistics and multilingual text mining. Constructing a parallel corpus requires effective alignment of parallel documents. In this paper, we develop a parallel page identification system for identifying and aligning parallel documents ...

متن کامل

Automatic Dictionary Construction and Identification of Parallel Text Pairs

When creating dictionaries for use in for example cross-language search engines, parallel or comparable text pairs are needed. Multilingual web sites may contain parallel texts but these can be difficult to detect. For instance, a multilingual website, Hallå Norden, contains information in five languages; Swedish, Danish, Norwegian, Icelandic and Finnish. Working with these texts we discovered ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000